Improve benchmark scripts via trials #1908

charleskawczynski · 2024-07-30T18:50:02Z

It turns out that running multiple trials, like BenchmarkTools.@btime is important to make a fair comparison and reduce the noise. We could use BenchmarkTime's @btime in the outer part and CUDA.@sync with an nreps loop for the inner part, but then we'd also need to interpolate a bunch, so I just went with writing simple loops instead.

Here is a summary of the updated numbers:

Indexing & static ndranges

[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), float_type = Float32, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                                       │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ BSR.at_dot_call!(X_array, Y_array; nreps=1000, bm)                          │ 68 microseconds, 641 nanoseconds │ 14.4882 │ 295.415     │ 1              │ 1000   │
│ BSR.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                        │ 13 microseconds, 787 nanoseconds │ 72.1366 │ 1470.86     │ 1              │ 1000   │
│ iscpu || BSR.custom_sol_kernel!(X_vector, Y_vector, Val(N); nreps=1000, bm) │ 12 microseconds, 925 nanoseconds │ 76.943  │ 1568.87     │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_vector, Y_vector, us; nreps=1000, bm)               │ 13 microseconds, 364 nanoseconds │ 74.4195 │ 1517.41     │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_vector, Y_vector, uss; nreps=1000, bm)              │ 12 microseconds, 929 nanoseconds │ 76.9247 │ 1568.49     │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, us; use_pw=false, nreps=1000, bm)   │ 41 microseconds, 5 nanoseconds   │ 24.2533 │ 494.525     │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=false, nreps=1000, bm)  │ 26 microseconds, 652 nanoseconds │ 37.3141 │ 760.835     │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, us; use_pw=true, nreps=1000, bm)    │ 13 microseconds, 582 nanoseconds │ 73.2243 │ 1493.04     │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=true, nreps=1000, bm)   │ 12 microseconds, 922 nanoseconds │ 76.9613 │ 1569.24     │ 1              │ 1000   │
└─────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), float_type = Float64, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                                       │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ BSR.at_dot_call!(X_array, Y_array; nreps=1000, bm)                          │ 69 microseconds, 10 nanoseconds  │ 28.8217 │ 587.673     │ 1              │ 1000   │
│ BSR.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                        │ 28 microseconds, 219 nanoseconds │ 70.4848 │ 1437.18     │ 1              │ 1000   │
│ iscpu || BSR.custom_sol_kernel!(X_vector, Y_vector, Val(N); nreps=1000, bm) │ 25 microseconds, 460 nanoseconds │ 78.1221 │ 1592.91     │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_vector, Y_vector, us; nreps=1000, bm)               │ 25 microseconds, 625 nanoseconds │ 77.6194 │ 1582.66     │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_vector, Y_vector, uss; nreps=1000, bm)              │ 25 microseconds, 436 nanoseconds │ 78.1975 │ 1594.45     │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, us; use_pw=false, nreps=1000, bm)   │ 41 microseconds, 621 nanoseconds │ 47.7881 │ 974.4       │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=false, nreps=1000, bm)  │ 27 microseconds, 111 nanoseconds │ 73.3654 │ 1495.92     │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, us; use_pw=true, nreps=1000, bm)    │ 25 microseconds, 931 nanoseconds │ 76.703  │ 1563.97     │ 1              │ 1000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=true, nreps=1000, bm)   │ 25 microseconds, 464 nanoseconds │ 78.1095 │ 1592.65     │ 1              │ 1000   │
└─────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Index swapping

[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), float_type = Float32, device_bandwidth_GBs=2039
┌──────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                                │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├──────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ BIS.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                 │ 34 microseconds, 617 nanoseconds │ 57.4574 │ 1171.56     │ 2              │ 1000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=0, nreps=1000, bm) │ 60 microseconds, 384 nanoseconds │ 32.939  │ 671.627     │ 2              │ 1000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=1, nreps=1000, bm) │ 68 microseconds, 108 nanoseconds │ 29.2034 │ 595.458     │ 2              │ 1000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=2, nreps=1000, bm) │ 60 microseconds, 395 nanoseconds │ 32.9329 │ 671.502     │ 2              │ 1000   │
└──────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), float_type = Float64, device_bandwidth_GBs=2039
┌──────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                                │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├──────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ BIS.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                 │ 59 microseconds, 558 nanoseconds │ 66.791  │ 1361.87     │ 2              │ 1000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=0, nreps=1000, bm) │ 63 microseconds, 238 nanoseconds │ 62.905  │ 1282.63     │ 2              │ 1000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=1, nreps=1000, bm) │ 80 microseconds, 502 nanoseconds │ 49.4142 │ 1007.56     │ 2              │ 1000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=2, nreps=1000, bm) │ 63 microseconds, 228 nanoseconds │ 62.9142 │ 1282.82     │ 2              │ 1000   │
└──────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Summary

Indexing & static ndranges:
- The very slow (array + dynamic size) kernel is (strangely) about the same between Float32 and Float64
- The SOL kernel (vector + static size) kernel is (expectedly) 2x faster for Float32 than Float64
- The array + fully static size is 2x slower than SOL for Float32 and only 8% slower than SOL for Float64
- The array + ("forced") linear indexing is equivalent to SOL for Float32 and Float64
Index swapping (all static sizes):
- The SOL kernel (vector, no swap) kernel is (expectedly) 2x faster for Float32 than Float64.
- The main take away: index swapping has a much greater performance degradation for Float32 than it does for Float64.

charleskawczynski added the Performance monitoring 🔍🚀 label Jul 30, 2024

Improve benchmark scripts via trials

6ba11fd

charleskawczynski force-pushed the ck/upate_benchmarks branch from 969af73 to 6ba11fd Compare July 30, 2024 18:53

charleskawczynski merged commit e2d61b0 into main Jul 30, 2024
16 of 19 checks passed

charleskawczynski deleted the ck/upate_benchmarks branch July 30, 2024 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve benchmark scripts via trials #1908

Improve benchmark scripts via trials #1908

charleskawczynski commented Jul 30, 2024 •

edited

Loading

Improve benchmark scripts via trials #1908

Improve benchmark scripts via trials #1908

Conversation

charleskawczynski commented Jul 30, 2024 • edited Loading

Indexing & static ndranges

Index swapping

Summary

charleskawczynski commented Jul 30, 2024 •

edited

Loading